RAGにおけるプライベートデータ準備の入門

RAGの基礎

標準的な大規模言語モデル（LLM）は、訓練データの期間が切れた時点で「凍結」されています。そのため、会社の内部マニュアルや昨日のプライベートな動画会議について質問に答えることはできません。検索拡張生成（RAG）このギャップを埋めるために、あなた自身のプライベートなデータから関連する文脈を取得してLLMに提供します。

マルチステップワークフロー

プライベートデータをLLMにとって「読める」形式にするためには、特定のパイプラインに従います：

読み込み：さまざまな形式（PDF、Web、YouTubeなど）を標準的なドキュメント形式に変換します。
分割：長文ドキュメントを小さく、扱いやすい「チャンク」に分割します。
埋め込み：テキストチャンクを数値ベクトル（意味の数学的表現）に変換します。
保存：これらのベクトルを、クロマのようなベクトルストアに保存し、瞬時に類似性検索を行うようにします。

なぜチャンク化が重要なのか

LLMには「コンテキストウィンドウ」と呼ばれる制限があり、一度に処理できるテキスト量に上限があります。100ページのPDFを送るとモデルは失敗します。そのため、最も関連性の高い情報だけをモデルに送るよう、データをチャンクに分割しています。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why is chunk_overlap considered a critical parameter when splitting documents for RAG?

To reduce the total number of tokens used by the LLM.

To ensure that semantic context (the meaning of a thought) is not cut off at the end of a chunk.

To make the vector database store data faster.

Challenge: Preserving Context

Apply your knowledge to a real-world scenario.

You are loading a YouTube transcript for a technical lecture. You notice that the search results are confusing "Lecture 1" content with "Lecture 2."

Task

Which splitter would be best for keeping context like "Section Headers" intact?

Solution:
MarkdownHeaderTextSplitter or RecursiveCharacterTextSplitter. These allow you to maintain document structure in the metadata, helping the retrieval system distinguish between different chapters or lectures.